Search CORE

51 research outputs found

HALO: Post-Link Heap-Layout Optimisation

Author: Berger Emery D.
Berger Emery D.
Calder Brad
Chilimbi Trishul M.
Chilimbi Trishul M.
David
Evans Jason
Evans Jason
Leijen Daan
Matthew
Nevill-Manning C. G.
Newman M. E. J.
Powers Bobby
Standard Performance Evaluation Corporation
Standard Performance Evaluation Corporation
Trishul
Truong D. N.
Publication venue: CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization
Publication date: 01/02/2020
Field of study

Today, general-purpose memory allocators dominate the landscape of dynamic memory management. While these so- lutions can provide reasonably good behaviour across a wide range of workloads, it is an unfortunate reality that their behaviour for any particular workload can be highly suboptimal. By catering primarily to average and worst-case usage patterns, these allocators deny programs the advantages of domain-specific optimisations, and thus may inadvertently place data in a manner that hinders performance, generating unnecessary cache misses and load stalls. To help alleviate these issues, we propose HALO: a post-link profile-guided optimisation tool that can improve the layout of heap data to reduce cache misses automatically. Profiling the target binary to understand how allocations made in different contexts are related, we specialise memory-management routines to allocate groups of related objects from separate pools to increase their spatial locality. Unlike other solutions of its kind, HALO employs novel grouping and identification algorithms which allow it to create tight-knit allocation groups using the entire call stack and to identify these efficiently at runtime. Evaluation of HALO on contemporary out-of-order hardware demonstrates speedups of up to 28% over jemalloc, out-performing a state-of-the-art data placement technique from the literature

Crossref

Apollo (Cambridge)

An efficient task-based all-reduce for machine learning applications

Author: Abadi Martín
Chilimbi Trishul M
Jia Yangqing
Krizhevsky Alex
Moritz Philipp
Team Theano Development
Thakur Rajeev
Zaharia Matei
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

All-Reduce is a collective-combine operation frequently utilised in synchronous parameter updates in parallel machine learning algorithms. The performance of this operation - and subsequently of the algorithm itself - is heavily dependent on its implementation, configuration and on the supporting hardware on which it is run. Given the pivotal role of all-reduce, a failure in any of these regards will significantly impact the resulting scientific output. In this research we explore the performance of alternative all-reduce algorithms in data-flow graphs and compare these to the commonly used reduce-broadcast approach. We present an architecture and interface for all-reduce in task-based frameworks, and a parallelization scheme for object-serialization and computation. We present a concrete, novel application of a butterfly all-reduce algorithm on the Apache Spark framework on a high-performance compute cluster, and demonstrate the effectiveness of the new butterfly algorithm with a logarithmic speed-up with respect to the vector length compared with the original reduce-broadcast method - a 9x speed-up is observed for vector lengths in the order of 108. This improvement is comprised of both algorithmic changes (65%) and parallel-processing optimization (35%). The effectiveness of the new butterfly all-reduce is demonstrated using real-world neural network applications with the Spark framework. For the model-update operation we observe significant speed-ups using the new butterfly algorithm compared with the original reduce-broadcast, for both smaller (Cifar and Mnist) and larger (ImageNet) datasets

Crossref

University of Birmingham Research Portal

Warwick Research Archives Portal Repository

Genuinely Distributed Byzantine Machine Learning

Author: Abadi Martín
Abraham Ittai
Alistarh Dan
Biggio Battista
Castro Miguel
Chen Lingjiao
Chilimbi Trishul M
El Mhamdi El Mahdi
Hecht-Nielsen Robert
Hsieh Kevin
Li Mu
Rajput Shashank
Xie Cong
Publication venue
Publication date: 02/06/2020
Field of study

Machine Learning (ML) solutions are nowadays distributed, according to the so-called server/worker architecture. One server holds the model parameters while several workers train the model. Clearly, such architecture is prone to various types of component failures, which can be all encompassed within the spectrum of a Byzantine behavior. Several approaches have been proposed recently to tolerate Byzantine workers. Yet all require trusting a central parameter server. We initiate in this paper the study of the ``general'' Byzantine-resilient distributed machine learning problem where no individual component is trusted. We show that this problem can be solved in an asynchronous system, despite the presence of

\frac{1}{3}

Byzantine parameter servers and

\frac{1}{3}

Byzantine workers (which is optimal). We present a new algorithm, ByzSGD, which solves the general Byzantine-resilient distributed machine learning problem by relying on three major schemes. The first, Scatter/Gather, is a communication scheme whose goal is to bound the maximum drift among models on correct servers. The second, Distributed Median Contraction (DMC), leverages the geometric properties of the median in high dimensional spaces to bring parameters within the correct servers back close to each other, ensuring learning convergence. The third, Minimum-Diameter Averaging (MDA), is a statistically-robust gradient aggregation rule whose goal is to tolerate Byzantine workers. MDA requires loose bound on the variance of non-Byzantine gradient estimates, compared to existing alternatives (e.g., Krum). Interestingly, ByzSGD ensures Byzantine resilience without adding communication rounds (on a normal path), compared to vanilla non-Byzantine alternatives. ByzSGD requires, however, a larger number of messages which, we show, can be reduced if we assume synchrony.Comment: This is a merge of arXiv:1905.03853 and arXiv:1911.07537; arXiv:1911.07537 will be retracte

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

SparCML: High-Performance Sparse Communication for Machine Learning

Author: Abadi Martín
Alistarh Dan
Chen Tianqi
Chilimbi Trishul M
Devlin Jacob
Hoefler T.
Hoefler T.
Hoefler T.
Interface Forum Message Passing
Ketkar Nikhil
Sa Christopher De
Seide Frank
Strom Nikko
Wen Wei
Publication venue
Publication date: 01/01/2019
Field of study

Applying machine learning techniques to the quickly growing data in science and industry requires highly-scalable algorithms. Large datasets are most commonly processed "data parallel" distributed across many nodes. Each node's contribution to the overall gradient is summed using a global allreduce. This allreduce is the single communication and thus scalability bottleneck for most machine learning workloads. We observe that frequently, many gradient values are (close to) zero, leading to sparse of sparsifyable communications. To exploit this insight, we analyze, design, and implement a set of communication-efficient protocols for sparse input data, in conjunction with efficient machine learning algorithms which can leverage these primitives. Our communication protocols generalize standard collective operations, by allowing processes to contribute arbitrary sparse input data vectors. Our generic communication library, SparCML, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations. As such, SparCML and its techniques will form the basis of future highly-scalable machine learning frameworks

arXiv.org e-Print Archive

Crossref

IST Austria: PubRep (Institute of Science and Technology)

Quantifying the effectiveness of testing via efficient residual path profiling

Author: Trishul M. Chilimbi
Publication venue
Publication date: 01/01/2007
Field of study

Software testing is extensively used for uncovering bugs in large, complex software. Testing relies on well designed regression test suites that anticipate all reasonable software usage scenarios. Unfortunately, testers today have no way of knowing how much of real-world software usage was untested by their regression suite. Recent advances in low-overhead path profiling provide the opportunity to rectify this deficiency and perform residual path profiling on deployed software. Residual path profiling identifies all paths executed by deployed software that were untested during software development. We extend prior research to perform low-overhead interprocedural path profiling. We demonstrate experimentally that low-overhead path profiling, both intraprocedural and interprocedural, provides valuable quantitative information on testing effectiveness. We also show that residual edge profiling is inadequate as a significant number of untested paths include no new untested edges

CiteSeerX

HeapMD: Identifying Heap-based Bugs using Anomaly Detection

Author: Trishul M. Chilimbi
Publication venue
Publication date: 01/01/2006
Field of study

We present the design, implementation, and evaluation of HeapMD, a dynamic analysis tool that finds heap-based bugs using anomaly detection. HeapMD is based upon the observation that, in spite of the evolving nature of the heap, several of its properties remain stable. HeapMD uses this observation in a novel way: periodically, during the execution of the program, it computes a suite of metrics which are sensitive to the state of the heap. These metrics track heap behavior, and the stability of the heap reflects quantitatively in the values of these metrics. The “normal ” ranges of stable metrics, obtained by running a program on multiple inputs, are then treated as indicators of correct behaviour, and are used in conjunction with an anomaly detector to find heap-based bugs. Using HeapMD, we were able to find 40 heap-based bugs, 31 of them previously unknown, in 5 large, commercial applications

CiteSeerX

Cache-Conscious Data Structures ? Design and Implementation

Author: Chilimbi Trishul M.
Publication venue: University of Wisconsin-Madison Department of Computer Sciences
Publication date: 01/01/1999
Field of study

Minds@University of Wisconsin